{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "57d67ef4-bf3e-46c1-8dfa-d6186a20ced9",
   "metadata": {},
   "source": [
    "# Using TensorRT-LLM to Run Phi-3-mini Models with LoRA Fine-tuned Models"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce4d2142-d326-47e2-90ea-dc8b1954f8d3",
   "metadata": {},
   "source": [
    "In this notebook you'll learn how to use NVIDIA's [TensorRT-LLM](https://developer.nvidia.com/tensorrt#section-inference-for-llms) to run a Phi-3-mini base model along with a task-specific Phi-3 model fine-tuned using Low-rank Adaptation (LoRA) technique. LoRA is a parameter efficient fine-tuning (PEFT) technique that introduces low-rank matrices into each layer of the LLM architecture, and only trains these matrices while keeping the original LLM weights frozen. It is one of the several LLM customization methods supported in NVIDIA [NeMo](https://www.nvidia.com/en-us/ai-data-science/generative-ai/nemo-framework/) described in this [blog](https://developer.nvidia.com/blog/selecting-large-language-model-customization-techniques).\n",
    "\n",
    "TensorRT-LLM is an open-source library that accelerates LLM inference performance on NVIDIA GPUs. NeMo is an end-to-end framework for building, customizing, and deploying generative AI applications. You can immediately try Phi-3 models through a browser user interface. Or, through API endpoints running on a fully accelerated NVIDIA stack from the [NVIDIA API catalog](http://build.nvidia.com/), where each model in Phi-3 is packaged as an [NVIDIA NIM](https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/) with a standard API that can be deployed anywhere.  \n",
    "\n",
    "To accelerate inferencing and offer state-of-the-art performance on NVIDIA GPUs, TensorRT-LLM compiles the models into TensorRT engines, from model layers into optimized CUDA kernels using [pattern matching and fusion](https://nvidia.github.io/TensorRT-LLM/architecture/core-concepts.html#pattern-matching-and-fusion). Those engines are executed by the TensorRT-LLM runtime which includes several advanced optimizations such as [in-flight batching](https://nvidia.github.io/TensorRT-LLM/advanced/gpt-attention.html#in-flight-batching), [KV caching](https://nvidia.github.io/TensorRT-LLM/performance/perf-best-practices.html#paged-kv-cache) and [quantization](https://nvidia.github.io/TensorRT-LLM/reference/precision.html) to support lower precision workloads. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "171bc898-878c-4ff8-99c9-42f843051b42",
   "metadata": {},
   "source": [
    "## Initial Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6eac80f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "!sudo apt-get update && apt-get -y install openmpi-bin libopenmpi-dev git git-lfs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8238fd96-1b88-4138-b0f8-3c4b190596d8",
   "metadata": {},
   "source": [
    "## Install TensorRT-LLM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "04162a0a-f94e-47ae-a3ed-6107bfb78ca7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n",
      "\u001b[0m\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n",
      "\u001b[0m"
     ]
    }
   ],
   "source": [
    "!pip install -q ipywidgets\n",
    "!pip install tensorrt_llm -U -q --extra-index-url https://pypi.nvidia.com"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "22ff5c4b-0ed9-4f7a-bc2c-72072e97aa37",
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/run.py -P .\n",
    "!wget https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/utils.py -P .\n",
    "!wget https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/quantization/quantize.py -P ."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5141ba92-5385-4cb0-b400-8ffe73bd5354",
   "metadata": {},
   "source": [
    "TensorRT-LLM supports running Phi-3-mini/small models with FP16/BF16/FP32 LoRA. In this notebook, we'll use Phi-3-mini as an example to show how to run an FP8 base model with FP16 LoRA module.\n",
    "\n",
    "- download the base model and lora model from Hugging Face"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0ec6671f-ea6c-4cdd-8a41-5de8d2a1dd82",
   "metadata": {},
   "outputs": [],
   "source": [
    "!git lfs install\n",
    "!git-lfs clone https://huggingface.co/microsoft/Phi-3-mini-4k-instruct\n",
    "!git-lfs clone https://huggingface.co/sikoraaxd/Phi-3-mini-4k-instruct-ru-lora"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a3254dd0-3d9d-45e5-80ca-aaeabbf33632",
   "metadata": {},
   "source": [
    "## Quantize Model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c69428be-e2a1-4869-91b3-87e57df34dd3",
   "metadata": {},
   "source": [
    "Now let's quantize the Phi-3-mini base model from Hugging Face to FP8 creating smaller model with lower memory footprint without sacrificing accuracy. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4d0e72d0-b3d9-4726-a8b3-6909c8577824",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Quantize the base model\n",
    "\n",
    "#BASE_PHI_3_MINI_MODEL=./Phi-3-mini-4k-instruct\n",
    "!python3 quantize.py --model_dir ./Phi-3-mini-4k-instruct \\\n",
    "                                   --dtype float16 \\\n",
    "                                   --qformat fp8 \\\n",
    "                                   --kv_cache_dtype fp8 \\\n",
    "                                   --output_dir phi3_mini_4k_instruct/trt_ckpt/fp8/1-gpu \\\n",
    "                                   --calib_size 512"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b231a302-4bda-44ca-b754-66d1e9173794",
   "metadata": {},
   "source": [
    "## Build Optimized TensorRT Engine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a5521f8b-afd6-459b-a96f-28083ebfbde9",
   "metadata": {},
   "outputs": [],
   "source": [
    "Next we build the TensorRT engine for the base model specifying the lora model as a config parameter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "5af5e1db-b175-4151-b260-6bd9bfcfa98d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[TensorRT-LLM] TensorRT-LLM version: 0.11.0\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [I] Set bert_attention_plugin to auto.\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [I] Set gpt_attention_plugin to auto.\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [I] Set gemm_plugin to auto.\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [I] Set nccl_plugin to auto.\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [I] Set lookup_plugin to None.\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [I] Set lora_plugin to auto.\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [I] Set moe_plugin to auto.\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [I] Set context_fmha to True.\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [I] Set paged_kv_cache to True.\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [I] Set remove_input_padding to True.\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [I] Set use_custom_all_reduce to True.\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [I] Set reduce_fusion to False.\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [I] Set multi_block_mode to False.\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [I] Set enable_xqa to True.\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [I] Set tokens_per_block to 64.\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [I] Set use_paged_context_fmha to False.\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [I] Set use_fp8_context_fmha to False.\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [I] Set multiple_profiles to False.\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [I] Set paged_state to True.\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [I] Set streamingllm to False.\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.producer = {'name': 'modelopt', 'version': '0.13.1'}\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.residual_mlp = False\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.bias = False\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.rotary_pct = 1.0\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.rank = 0\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.decoder = phi3\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.rmsnorm = True\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.lm_head_bias = False\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.rotary_base = 10000.0\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.original_max_position_embeddings = 4096\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.rotary_scaling = None\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. \n",
      "\n",
      "[07/21/2024-06:50:42] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored\n",
      "[07/21/2024-06:50:43] [TRT-LLM] [I] Set dtype to float16.\n",
      "[07/21/2024-06:50:45] [TRT] [I] [MemUsageChange] Init CUDA: CPU +16, GPU +0, now: CPU 241, GPU 457 (MiB)\n",
      "[07/21/2024-06:50:48] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +4497, GPU +1242, now: CPU 4885, GPU 1699 (MiB)\n",
      "[07/21/2024-06:50:48] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.\n",
      "[07/21/2024-06:50:48] [TRT-LLM] [I] Set nccl_plugin to None.\n",
      "[07/21/2024-06:50:48] [TRT-LLM] [I] Set use_custom_all_reduce to True.\n",
      "[07/21/2024-06:50:49] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0\n",
      "[07/21/2024-06:50:49] [TRT] [W] Unused Input: position_ids\n",
      "[07/21/2024-06:50:49] [TRT] [W] Detected layernorm nodes in FP16.\n",
      "[07/21/2024-06:50:49] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.\n",
      "[07/21/2024-06:50:49] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.\n",
      "[07/21/2024-06:50:49] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.\n",
      "[07/21/2024-06:51:14] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.\n",
      "[07/21/2024-06:51:14] [TRT] [I] Detected 270 inputs and 1 output network tensors.\n",
      "[07/21/2024-06:51:16] [TRT] [I] Total Host Persistent Memory: 122176\n",
      "[07/21/2024-06:51:16] [TRT] [I] Total Device Persistent Memory: 0\n",
      "[07/21/2024-06:51:16] [TRT] [I] Total Scratch Memory: 218103808\n",
      "[07/21/2024-06:51:16] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 686 steps to complete.\n",
      "[07/21/2024-06:51:16] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 42.009ms to assign 18 blocks to 686 nodes requiring 1006638592 bytes.\n",
      "[07/21/2024-06:51:16] [TRT] [I] Total Activation Memory: 1006638592\n",
      "[07/21/2024-06:51:16] [TRT] [I] Total Weights Memory: 4020264580\n",
      "[07/21/2024-06:51:16] [TRT] [I] Engine generation completed in 26.9806 seconds.\n",
      "[07/21/2024-06:51:16] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1 MiB, GPU 3834 MiB\n",
      "[07/21/2024-06:51:17] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 14672 MiB\n",
      "[07/21/2024-06:51:17] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:28\n",
      "[07/21/2024-06:51:17] [TRT] [I] Serialized 497 bytes of code generator cache.\n",
      "[07/21/2024-06:51:17] [TRT] [I] Serialized 773445 bytes of compilation cache.\n",
      "[07/21/2024-06:51:17] [TRT] [I] Serialized 18 timing cache entries\n",
      "[07/21/2024-06:51:17] [TRT-LLM] [I] Timing cache serialized to model.cache\n",
      "[07/21/2024-06:51:17] [TRT-LLM] [I] Serializing engine to phi3_mini_4k_instruct/trt_engines/fp8_lora/1-gpu/rank0.engine...\n",
      "[07/21/2024-06:51:23] [TRT-LLM] [I] Engine serialized. Total time: 00:00:05\n",
      "[07/21/2024-06:51:23] [TRT-LLM] [I] Total time of building all engines: 00:00:40\n"
     ]
    }
   ],
   "source": [
    "# Build TensorRT engine\n",
    "\n",
    "!trtllm-build --checkpoint_dir phi3_mini_4k_instruct/trt_ckpt/fp8/1-gpu \\\n",
    "             --output_dir phi3_mini_4k_instruct/trt_engines/fp8_lora/1-gpu \\\n",
    "             --gemm_plugin auto \\\n",
    "             --max_batch_size 8 \\\n",
    "             --max_input_len 1024 \\\n",
    "             --max_seq_len 2048 \\\n",
    "             --lora_plugin auto \\\n",
    "             --lora_dir ./Phi-3-mini-4k-instruct-ru-lora"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "48cc7cce-7ff5-48d0-9933-f3111f03342c",
   "metadata": {},
   "source": [
    "## Run Inference for Q&A Task"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "abb0f26a-d10b-4a1c-be83-bce0378cd9ed",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[TensorRT-LLM] TensorRT-LLM version: 0.11.0\n",
      "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [W] Implicitly setting PretrainedConfig.producer = {'name': 'modelopt', 'version': '0.13.1'}\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [W] Implicitly setting PretrainedConfig.residual_mlp = False\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [W] Implicitly setting PretrainedConfig.bias = False\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [W] Implicitly setting PretrainedConfig.rotary_pct = 1.0\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [W] Implicitly setting PretrainedConfig.rank = 0\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [W] Implicitly setting PretrainedConfig.decoder = phi3\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [W] Implicitly setting PretrainedConfig.rmsnorm = True\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [W] Implicitly setting PretrainedConfig.lm_head_bias = False\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [W] Implicitly setting PretrainedConfig.rotary_base = 10000.0\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [W] Implicitly setting PretrainedConfig.original_max_position_embeddings = 4096\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [W] Implicitly setting PretrainedConfig.rotary_scaling = None\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set dtype to float16.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set bert_attention_plugin to auto.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set gpt_attention_plugin to auto.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set gemm_plugin to auto.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to None.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set identity_plugin to None.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set layernorm_quantization_plugin to None.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to None.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set nccl_plugin to None.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set lookup_plugin to None.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set lora_plugin to auto.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set weight_only_groupwise_quant_matmul_plugin to None.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to None.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set quantize_per_token_plugin to False.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set quantize_tensor_plugin to False.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set moe_plugin to auto.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set context_fmha to True.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set paged_kv_cache to True.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set remove_input_padding to True.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set use_custom_all_reduce to True.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set reduce_fusion to False.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set multi_block_mode to False.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set enable_xqa to True.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set tokens_per_block to 64.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set use_paged_context_fmha to False.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set use_fp8_context_fmha to False.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set multiple_profiles to False.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set paged_state to True.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [I] Set streamingllm to False.\n",
      "[07/21/2024-06:51:28] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored\n",
      "[07/21/2024-06:51:28] [TRT] [I] Loaded engine size: 3845 MiB\n",
      "[07/21/2024-06:51:29] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3834 (MiB)\n",
      "[07/21/2024-06:51:29] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3834 (MiB)\n",
      "[07/21/2024-06:51:29] [TRT-LLM] [W] The paged KV cache in Python runtime is experimental. For performance and correctness, please, use C++ runtime.\n",
      "[07/21/2024-06:51:30] [TRT-LLM] [I] Load engine takes: 2.5270047187805176 sec\n",
      "/usr/local/lib/python3.10/dist-packages/torch/nested/__init__.py:166: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at ../aten/src/ATen/NestedTensorImpl.cpp:178.)\n",
      "  return _nested.nested_tensor(\n",
      "Input [Text 0]: \"<s><|user|> \\nCan you provide ways to eat combinations of bananas and dragonfruits?<|end|> \\n<|assistant|>\"\n",
      "Output [Text 0 Beam 0]: \"Certainly! Bananas and dragonfruits can be combined in various delicious ways. Here are some creative and healthy recipes for you to try:\n",
      "\n",
      "1. Banana and Dragonfruit Smoothie:\n",
      "   - Blend together 1 ripe banana, 1/2 cup of dragonfruit, 1/2 cup of Greek yogurt, and a handful of ice cubes. Add a splash of almond milk or water for desired consistency.\n",
      "\n",
      "2. Banana and Dragonfruit Salad:\n",
      "   - Slice a ripe banana and a dragonfruit into bite-sized pieces. Toss them together with mixed greens, sliced almonds, and a light vinaigrette dressing.\n",
      "\n",
      "3. Banana and Dragonfruit Salsa:\n",
      "   - Dice a ripe banana and a dragonfruit, and mix them with diced tomatoes, red onion, cilantro, lime juice, and a pinch of salt. Serve with tortilla chips or as a topping for grilled chicken or fish.\n",
      "\n",
      "4. Banana and Dragonfruit Wrap:\n",
      "   - Spread a layer of cream cheese or hummus on a whole wheat tortilla. Add sliced banana and dragonfruit, along with some shredded carrots, spinach, and avocado. Roll up the tortilla and enjoy!\n",
      "\n",
      "5. Banana and Dragonfruit Parfait:\n",
      "   - Layer sliced banana and dragonfruit with Greek yogurt, granola, and a drizzle of honey in a glass or jar. Repeat the layers until the glass is full, and enjoy!\n",
      "\n",
      "6. Banana and Dragonfruit Ice Cream:\n",
      "   - Blend together 2 ripe bananas, 1/2 cup of dragonfruit, 1/4 cup of coconut milk, and a splash of vanilla extract. Pour the mixture into a container and freeze for a few hours. Blend again until smooth, and enjoy your homemade banana and dragonfruit ice cream!\n",
      "\n",
      "7. Banana and Dragonfruit Oatme\"\n"
     ]
    }
   ],
   "source": [
    "# Run inference\n",
    "\n",
    "!python3 run.py --engine_dir phi3_mini_4k_instruct/trt_engines/fp8_lora/1-gpu \\\n",
    "                 --max_output_len 500 \\\n",
    "                 --tokenizer_dir ./Phi-3-mini-4k-instruct-ru-lora \\\n",
    "                 --input_text \"<|user|>\\nCan you provide ways to eat combinations of bananas and dragonfruits?<|end|>\\n<|assistant|>\" \\\n",
    "                 --lora_task_uids 0 \\\n",
    "                 --use_py_session"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "91cef4f2-a3a2-40eb-963f-fc395f455197",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
